DRAFT DRAFT DRAFT
(Playing along at home? Here’s what you’ll need:)
(.packages())[1] "stats" "graphics" "grDevices" "utils" "datasets" "methods"
[7] "base"
stats
graphics
grDevices
utils
datasets
methods
base
select <- dplyr::selectSo, as you may or may not know, Uber and Lyft, America’s two premiere ridesharing services, stopped offering rides in Austin, TX as of May 9, 2016 after a contentious public vote which you can read about here: http://www.bizjournals.com/austin/news/2016/05/07/uber-lyft-defeated-in-prop-1-referendum.html
A common argument is that the presence of ride-sharing services such as Uber or Lyft decreases the incidence of drunken driving in a city in which they operate; it sounds reasonable enough to say that if there are more options for getting home after drinking, at least one of them might be more attractive than driving home in a potentially compromised state.
There have been a few articles and a couple papers studying the effect of the availability of ridesharing services on drunk driving (see references below), but in my opinion they fall short in one or more ways:
They fail to count all available records of drunk driving (like if, say, a study only counted drunk driving fatalities) They fail to account for trends in drunk driving frequency (like if, say, more people than average drove drunk on certain holidays), or aggregate time-series data too broadly They fail to control for exogenous effects on drunk driving frequency (like if, say, the population of a core drinking demographic increased) They fail to properly treat the regression discontinuity between data before and after ridesharing was available in Austin (more on this later)
I’ve managed to get ahold of three datasets that I think will be helpful:
The APD (Austin Police Department) Incident Reports are publicly available for years 2008-2011, and for ‘YTD’ which supposedly includes the last 18 months of reported crime in Austin but actually only includes 8 months of 2016 and a small number of reports from 2015. I was able to obtain an archived version of the data from late 2015, which has more complete data for Jan-Jul 2015.
The APD Chief’s Monthly Report is a high-level summary of crime in Austin for the month, including counts for high-profile crimes including DWI. It has no time-of-day or location information, and no info other than just that a DWI occurred, but is another source of data to compare to the incident reports, which may be spotty. It additionally contains data on narcotics arrests, which we can check to see if we can use it as a control variable - narcotics crimes may follow the same broad trends as DWI, but are potentially unaffected by ridesharing.
It’s worth noting that the study of DWI arrest numbers as a proxy for the incidence of drunk driving is an open area of research - some studies prefer drunk driving crashes, although Austin has not released enough monthly statistics on drunk driving crashes to make it a usable variable in these models. We’ll assume DWI arrests are a viable proxy for drunk driving here.
The TABC (Texas Alcoholic Beverage Commission) Mixed Beverage Tax is a fixed tax charged on alcoholic beverages sold by a holder of a Mixed Beverage License in the state of Texas. I suspect that Mixed Beverage Tax revenues are correlated with Units of Alcohol Consumed in Austin, and we’ll check to see if any relationship exists between these tax revenues and DWI counts.
March 2014 : Uber is around for SXSW.
June 3, 2014: Lyft launches in Austin. (https://twitter.com/AustinLyft/status/473857057720766464)
June 4, 2014 : Uber launches in Austin. (https://twitter.com/Uber_ATX/status/474199740217716736)
May 9, 2016: Uber and Lyft stop giving rides in Austin.
We’ll call June 4 and May 9 treatment dates - the inflection points we’ll investigate to see if the data on either side is substantially different.
You can find the same data I used here: * https://data.austintexas.gov/browse?q=apd&sortBy=relevance * https://www.comptroller.texas.gov/transparency/open-data/search-datasets/ * https://austintexas.gov/page/chiefs-monthly-reports * https://durs.carto.com/tables/apd_incident_extract_2015csv/public
Or you can pull it from my Github, http://www.github.com/ianwells
Alright, Let’s get to it.
Here’s a quick glimpse of what the raw APD data looks like:
sample_n(dwi.raw,10)Cool. We have a bunch of crimes, pre-filtered to intoxication-related ones and the time they approximately occurred.
You can check this against the full list of crimes available to commit in Austin, which is available on that same Austin Data Portal site, I’m not including it here because it’s quite long, please trust that I’ve got all of them.
levels(dwi.raw$crime)[1] "CRASH/INTOX MANSLAUGHTER" "CRASH/INTOXICATION ASSAULT"
[3] "DRIVING WHILE INTOX / FELONY" "DUI - AGE 17 TO 20"
[5] "DWI" "DWI .15 BAC OR ABOVE"
[7] "DWI 2ND"
CRASH/INTOX MANSLAUGHTER
CRASH/INTOXICATION ASSAULT
DRIVING WHILE INTOX / FELONY
DUI - AGE 17 TO 20
DWI
DWI .15 BAC OR ABOVE
DWI 2ND
If you’re familiar with this data set you’ll notice I’ve removed a few crimes:
BOATING WHILE INTOXICATED - This rare crime doesn’t seem like it could be avoided by hailing an Uber.DWI - DRUG RECOGNITION EXPERT - You don’t have to have any alcohol in your system to be charged with a DWI: in this case, an officer certified in recognizing the effects of some other substance has decided you shouldn’t be driving. While I imagine you could avoid trouble by hailing a ride in this case, I don’t think I could relate it to bar sales, so this stat is going to be relegated to some other study.CRASH/LEAVING THE SCENE - This is a tricky one. Certainly a non-zero percentage of hit-and-run offenses are committed by drunk drivers, but this charge doesn’t offer any more details, and worse, there are more of this kind of crime than all the others put together. How should we count them, if at all? For now, let’s leave them all out, as they’re not counted in official DWI tallies. We might be able to project how many of these crimes involve alcohol at a later date, based on when and where they occurred, but we’ll leave that for another analysis.Let’s take a look at the relative frequency of these offenses. Periodicity can be exploited later on.
qplot(hour(date), data=dwi.raw, geom="histogram", bins = 24)Perhaps not surprisingly, nearly all DWI offenses occur between 8 PM and 5 AM.
qplot(wday(date), data=dwi.raw, geom="histogram", bins = 7)Again, looks like Fridays and Saturdays are popular days to get caught drunk driving.
We’re going to look at the whole series, and then a histogram of monthly data to see if there’s a seasonal trend (you might expect more DWI’s to occur in December, January, or July months with holiday weekends that involve a lot of drinking)
Note: we have incomplete data for July 2015 and August 2016, as well as an old offense in 2014, so we’ll remove those months now. We’ll leave 2015 and 2016 out of the monthly histogram because we only have data from a few months for those years.
dwi.raw$year <- year(dwi.raw$date)
dwi.raw$month <- month(dwi.raw$date)
dwi.monthly <- dwi.raw %>% group_by(year,month)
dwi.monthly <- summarize(dwi.monthly,count=n())
dwi.monthly$date <- as.Date(paste0(dwi.monthly$year,'-',dwi.monthly$month,'-01'))
dwi.monthly <- dwi.monthly[-56,] #remove partial July 2015 data
dwi.monthly <- dwi.monthly[-63,] #remove partial August 2016 data
dwi.monthly <- dwi.monthly[-49,] #remove lone 2014 arrest
label.months <- c('J','F','M','A','M','J','J','A','S','O','N','D')
ggplot(data=dwi.monthly, aes(x = date, y = count, group = year(date), color = factor(year(date)))) + geom_point()qplot(factor(month(date)), data=filter(dwi.monthly,date < '2015-01-01'), geom="bar", weight = count)So we can get a sense that, for the years 2008-2011, monthly DWI’s seem to be decreasing (good job APD), and as for a monthly trend we see an annual peak in December-January, as well as something going on in August-October.
Now, I’m curious to see how the Chief’s Monthly Report compares.
apd.raw$date <- as.Date(apd.raw$date)
apd.raw$year <- year(apd.raw$date)
apd.raw$month <- month(apd.raw$date)
apd.monthly <- apd.raw
ggplot(data=apd.monthly, aes(x = date, y = count, color = factor(year))) + geom_point() qplot(factor(month(date)), data=apd.monthly, geom="bar", weight = count)This has a similar December-January peak. Let’s plot both at the same time.
both.monthly <- merge(apd.monthly,dwi.monthly, by = 'date', all = TRUE);
both.monthly$apd <- both.monthly$count.x
both.monthly$dwi <- both.monthly$count.y
both.monthly <- both.monthly %>% select(date,apd,dwi)
both.melted <-melt(both.monthly, id.vars="date")
ggplot(data=both.melted, mapping = aes(x=date,y=value, color = variable)) + geom_point()Warning: Removed 73 rows containing missing values (geom_point).
Okay, so it looks like the chief’s report is a little different than the incident report database. That’s annoying, but based on the disclaimers on both websites, and that the chief’s report is updated more often, I’m going to call it and say the chief’s data is what I’m using moving forward. I’d like those extra two years of data that the database has though - and since the data is so historic, and the difference is so consistent, what I’m going to do is nudge the data up slightly so that it lines up.
both.monthly$diff <- both.monthly$apd - both.monthly$dwi
mean(filter(both.monthly,date < '2012-01-01', date > '2010-01-01')$diff)[1] 14.86957
both.monthly.nudge <- both.monthly
both.monthly.nudge$dwi = both.monthly.nudge$dwi + 15 #approximate mean difference
both.melted.nudge <-melt(both.monthly.nudge, id.vars="date")
ggplot(data=both.melted.nudge, mapping = aes(x=date,y=value, color = variable)) + geom_point()Warning: Removed 145 rows containing missing values (geom_point).
Alright, that’s a little cleaner. We’ll move forward with this set of data until I can get more comprehensive arrest records, being able to use weekly periodicity might be useful.
The Chief’s Monthly Report lists all Part 1 Offenses (according to the Uniform Crime Reporting standards established by the FBI, mostly violent felonies) and a few key Part 2 Offenses, namely DWI, narcotics, prostitution, and weapons-law related crimes. Prostitution and weapons crimes occur sparsely compared to DWI and narcotics, so we’ll import the narcotics arrest data in hopes that it will allow us to control for sweeping effects across all crime categories - things like the number of officers working, motivation of the police force, population of the city, etc.
ggplot(data=apd.monthly, aes(x = date, y = narc, color = factor(year))) + geom_point() qplot(factor(month(date)), data=apd.monthly, geom="bar", weight = narc)Not a lot to say here - there’s a strong January but not a lot else that lines up with the DWI data. We’ll see if they’re correlated later.
I’ve saved out our final data set here to save time.
dwi <- read.csv('data/dwi-final.csv')
#sapply(dwi,class)
dwi$date <- as.Date(dwi$date)
dwi$count <- as.numeric(as.character(dwi$count)) #cast strictly
dwi$narc <- as.numeric(as.character(dwi$narc))Warning: NAs introduced by coercion
dwi$treatment <- as.factor(dwi$treatment) #0: pre uber, 1:uber, 2:post uberLet’s have a look at what’s going on with the TABC data. Now, I’ve already done a bit of work with this data set in the past (see tabcmap.s3-website-us-east-1.amazonaws.com), so I’ve just pulled values from my database of monthly revenues calculated from monthly tax receipts.
Here’s what a few rows look like:
ADJUST FOR INFLATION
sample_n(tabc.raw,10)Let’s do the same all-time and then year-over-year plots to get a sense of what’s in there.
There’s a lone data point from 1999, let’s clean that up too, and let’s scale revenues into the millions of dollars.
tabc.monthly <- tabc.raw[-40,] #remove lone 1999 point
tabc.monthly$date <- as.Date(tabc.monthly$date)
tabc.monthly$year <- year(tabc.monthly$date)
tabc.monthly$month <- month(tabc.monthly$date)
tabc.monthly$rev <- round(tabc.monthly$rev/1000000,2)
ggplot(data=tabc.monthly, aes(x = date, y = rev, group = year, color = factor(year))) + geom_point()qplot(factor(month(date)), data=tabc.monthly, geom="bar", weight = rev)Look at that growth! Notice that spike in March? That’s SXSW, a major annual music, technology, and film conference. See that spike in October? That’s Austin City Limits, a major music festival. This data is already good to go, so let’s rename it to keep things clean and keep going.
tabc <- select(tabc.monthly,date,rev)Let’s see about a new variable - DWI per Millions of Dollars Bar Revenue. I think that this variable may be useful as it incorporates things you would expect to vary with bar revenue - drinking-age population and popularity of drinking - into the DWI count. All things the same, a town with a lower DWI-per-drink-consumed number is going better than one with a higher ratio. We may be able to link ridesharing to an improvement in this ratio more convincingly than to just raw DWI’s.
dwi <- merge(dwi,tabc, by = 'date')
dwi$count_per_mm <- dwi$count/dwi$rev
ggplot(data=dwi, aes(x = date, y = count_per_mm, group = year(date), color = factor(year(date)))) + geom_point()ggplot(data=dwi, aes(x = rev, y = count, color = year(date))) + geom_point() + geom_smooth(method = "lm")summary(lm(data = dwi, count ~ rev))
Call:
lm(formula = count ~ rev, data = dwi)
Residuals:
Min 1Q Median 3Q Max
-124.302 -36.895 2.742 37.032 112.456
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 471.91537 21.21465 22.245 <2e-16 ***
rev 0.01024 0.47007 0.022 0.983
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 51.7 on 105 degrees of freedom
Multiple R-squared: 4.519e-06, Adjusted R-squared: -0.009519
F-statistic: 0.0004745 on 1 and 105 DF, p-value: 0.9827
So while DWI/$MM is clearly decreasing, it would appear that there’s no correlation between DWI counts and bar revenue (they’re independent variables). We’d maybe expect to see annual DWI spikes in March if there was, but we don’t. This is not at all what I was expecting to see. Oh well.
Alright, so now that we’ve got some data, what can we do with it? How can we establish one way or the other that the advent of ridesharing affected the number of DWI arrests in Austin? There are a few things I’m going to look at.
Holt-Winters Forecasting is a method of predicting time-series data that has strong periodic components. We saw above that our DWI data has some yearly trends in it - the data shows it, and you can even Google around for articles about DWI in Austin and see for yourself that January is a special month for the police (look for ‘no-refusal weekend’). We’ll decompose the DWI really quick and you can see for yourself:
plot(stl(ts(dwi$count, start = c(2008, 1), frequency = 12), s.window="periodic"))You can see seasonality contributes to a swing of 70 DWI arrests.
What I’m going to do is feed DWI data from January 2008 - May 2014 (the last month before Uber and Lyft were widely available in Austin) into a Holt-Winters model and forecast DWI counts for the next few months. I’m then going to compare those to the actual data, with the reasoning that since the model takes into account yearly cycles and broad trends, it will come up with a prediction for DWI counts as if ridesharing never appeared. We can call the difference the number of DWI’s ridesharing prevented, with some degree of confidence.
dwi.pre <- filter(dwi,date < '2014-05-31')
dwi.post <- filter(dwi,date > '2014-05-31')
dwi.pre.count.ts <- ts(dwi.pre$count, start = c(2008, 1), frequency = 12)
dwi.post.count.ts <- ts(dwi.post$count, start = c(2014, 6), frequency = 12)
dwi.pre.count.hw <- HoltWinters(dwi.pre.count.ts)
forecast <- predict(dwi.pre.count.hw, n.ahead = 8, prediction.interval = T, level = 0.9)
plot(dwi.pre.count.hw, forecast)
lines(dwi.post.count.ts,col='green')The forecast is pretty rough - the data may not have been periodic enough for this model to be effective. For the first 8 months Uber and Lyft were available in Austin, the real cumulative DWI count was 158 less than the predicted count - about 4%. For the first 3 months, where our confidence interval is much tighter, we’re only 24 DWI’s less than predicted - a decrease of 1.5%.
Let’s try the other side - fit a model from data starting from Uber’s first month, and then stopping on Uber’s last month. If the prediction is lower than the real data, we might be able to say that ridesharing was having an effect, but there was some other factor contaminating the data from 2008-2014 (say an organized campaign against drunk driving or something).
We’re going to cheat a tiny bit because we need another sample of data (Holt-Winters requires 2 full years of data here) - so I’m going to say Uber actually opened for business a month earlier than it did. Hopefully the first month was relatively slow as people started adopting to ridesharing, and it won’t make much of a difference to stretch the effect out over one more month in addition the 23 months Uber and Lyft were in Austin.
dwi.during <- dwi %>% filter(date >= '2014-5-01') %>% filter(date < '2016-05-01')
dwi.exit <- filter(dwi,date >= '2016-05-01')
dwi.during.count.ts <- ts(dwi.during$count, start = c(2014, 5), frequency = 12)
dwi.exit.count.ts <- ts(dwi.exit$count, start = c(2016, 5), frequency = 12)
dwi.during.count.hw <- HoltWinters(dwi.during.count.ts)
forecast <- predict(dwi.during.count.hw, n.ahead = 6, prediction.interval = T, level = 0.9)
plot(dwi.during.count.hw, forecast)
lines(dwi.exit.count.ts,col='green')This looks better - the model fits better, the confidence interval is tighter, and the treatment effect is 80 DWI’s over the first 3 months, a decrease of about 6%. In other words, not having Uber and Lyft may have resulted in an additional 80 arrests. Maybe more convincing than the initial effect, but let’s see if we can do better.
Google has a great library for testing the causal effect of a given treatment on a time series. Designed for testing the effectiveness of advertising, I think this library will help us out - it’s designed for inference on non-experimental data, which is what we’ve got.
Typically, establishing causation requires good experimental design. We don’t have the resources to do a city-scale ridesharing/drunk-driving experiment, so we’ll have to use some fancy math and our existing data.
For this, we require a response and predictor - let’s use DWI count as a response, and TABC revenue as a control, as it appears TABC sales growth is linear through the period Uber was available in Austin.
I’m going to take a relatively well-behaved stretch of data around the treatment - let’s look at the time-series again, with the treatment area in grey:
uber <- geom_rect(mapping = aes(xmin=as.Date('2014-06-01'), xmax = as.Date('2016-05-01')), ymin=-Inf, ymax=+Inf, color="grey20", alpha=0.3, inherit.aes = FALSE)
ggplot(data=dwi, mapping = aes(x=date,y=count, color = year(date))) + uber + geom_point() + geom_smooth(se = FALSE)`geom_smooth()` using method = 'loess'
It appears that there’s an inflection around March 2012 - I don’t know what caused it, and it happened before Uber showed up, so whatever it is it’s outside the scope of this model. CausalImpact should be able to take the downward trend in DWI and isolate the effect of the treatment in June 2014, if it exists.
We’re also limiting the TABC data, so let’s look at that, and the correlation for the time frame:
ggplot(data=filter(filter(dwi, date > '2012-04-01'),date < '2016-05-09'), mapping = aes(x=date,y=rev, color = year(date))) + uber + geom_point() + geom_smooth(method = 'lm', se = FALSE)ggplot(data=filter(filter(dwi, date > '2012-04-01'),date < '2016-05-09'), mapping = aes(x=rev,y=count)) + geom_point() + geom_smooth(method = 'lm')summary(lm(data=filter(filter(dwi, date > '2012-04-01'),date < '2016-05-09'), count ~ rev))
Call:
lm(formula = count ~ rev, data = filter(filter(dwi, date > "2012-04-01"),
date < "2016-05-09"))
Residuals:
Min 1Q Median 3Q Max
-82.122 -26.663 -2.639 23.013 118.208
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 591.9698 38.5197 15.368 <2e-16 ***
rev -2.0401 0.7402 -2.756 0.0083 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 37.94 on 47 degrees of freedom
Multiple R-squared: 0.1391, Adjusted R-squared: 0.1208
F-statistic: 7.597 on 1 and 47 DF, p-value: 0.008297
What we care about is that there is a correlation between TABC revenue and DWI arrests, and that there does not appear to be a discontinuity in the TABC revenue around when we expect a discontinuity to appear in the DWI data. The correlation exists, but it is negative, and although it looks like TABC sales are unaffected by ridesharing, we can do a quick check with our Holt-Winters method.
(If you’re running this at home note that you’ll need the development version of devtools to install CausalImpact due to a bug in the CRAN version of devtools. Get it with devtools::install_github("hadley/devtools") )
dwi.ci <- select(dwi,date,count,rev)
pre.ci <- as.Date(c("2012-04-01", "2014-05-31"))
post.ci <- as.Date(c("2014-06-01", "2016-05-09"))
impact <- CausalImpact(dwi.ci, pre.ci, post.ci, model.args = list( prior.level.sd = 0.1,nseasons = 12, season.duration = 1))
plot(impact)Warning: Removed 107 rows containing missing values (geom_path).
Warning: Removed 57 rows containing missing values (geom_path).
Warning: Removed 214 rows containing missing values (geom_path).
impactPosterior inference {CausalImpact}
Average Cumulative
Actual 463 11114
Prediction (s.d.) 515 (19) 12370 (454)
95% CI [480, 553] [11517, 13263]
Absolute effect (s.d.) -52 (19) -1256 (454)
95% CI [-90, -17] [-2149, -403]
Relative effect (s.d.) -10% (3.7%) -10% (3.7%)
95% CI [-17%, -3.3%] [-17%, -3.3%]
Posterior tail-area probability p: 0.004
Posterior prob. of a causal effect: 99.6%
For more details, type: summary(impact, "report")
CausalImpact attributes a 10% drop in DWI’s to ridesharing.
UNDERSTAND HOW THIS WORKS MORE AND TALK ABOUT IT
Let’s look and see if Uber and Lyft leaving had any effect.
dwi.ci <- select(dwi,date,count,rev)
pre.ci <- as.Date(c("2014-05-31", "2016-05-09"))
post.ci <- as.Date(c("2016-05-10", "2016-12-31"))
impact <- CausalImpact(dwi.ci, pre.ci, post.ci,model.args = list( prior.level.sd = 0.1))Warning in FormatInputPrePostPeriod(pre.period, post.period, data): Setting
post.period[2] to end of data: 2016-11-01
plot(impact)Warning: Removed 107 rows containing missing values (geom_path).
Warning: Removed 77 rows containing missing values (geom_path).
Warning: Removed 214 rows containing missing values (geom_path).
impactPosterior inference {CausalImpact}
Average Cumulative
Actual 435 2608
Prediction (s.d.) 454 (13) 2722 (81)
95% CI [427, 481] [2563, 2888]
Absolute effect (s.d.) -19 (13) -114 (81)
95% CI [-47, 7.5] [-280, 44.9]
Relative effect (s.d.) -4.2% (3%) -4.2% (3%)
95% CI [-10%, 1.6%] [-10%, 1.6%]
Posterior tail-area probability p: 0.071
Posterior prob. of a causal effect: 93%
For more details, type: summary(impact, "report")
According to this, voting out Uber and Lyft led to a 5% drop in DWI’s in the second half of 2016, which seems counter-intuitive.
It’s possible that TABC revenues are a bad control variable here, let’s see if narcotics arrests are better.
ggplot(data=filter(filter(dwi, date > '2012-04-01'),date < '2016-05-09'), mapping = aes(x=date,y=narc, color = year(date))) + uber + geom_point() + geom_smooth(method = 'lm', se = FALSE)ggplot(data=filter(filter(dwi, date > '2012-04-01'),date < '2016-05-09'), mapping = aes(x=narc,y=count)) + geom_point() + geom_smooth(method = 'lm')summary(lm(data=filter(filter(dwi, date > '2012-07-01'),date < '2016-05-09'), count ~ narc))
Call:
lm(formula = count ~ narc, data = filter(filter(dwi, date > "2012-07-01"),
date < "2016-05-09"))
Residuals:
Min 1Q Median 3Q Max
-100.861 -24.899 5.067 23.005 79.604
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 248.3555 68.4290 3.629 0.000736 ***
narc 0.4557 0.1305 3.491 0.001106 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 36.83 on 44 degrees of freedom
Multiple R-squared: 0.217, Adjusted R-squared: 0.1992
F-statistic: 12.19 on 1 and 44 DF, p-value: 0.001106
In this case, we have a positive correlation, and again no obvious discontinuity.
And our impact test with narcotics arrests as a control:
dwi.ci.n <- filter(select(dwi,date,count,narc),date > '2012-04-01')
pre.ci <- as.Date(c("2012-04-01", "2014-05-31"))
post.ci <- as.Date(c("2014-06-01", "2016-05-09"))
impact.n <- CausalImpact(dwi.ci.n, pre.ci, post.ci,model.args = list(prior.level.sd = 0.1,nseasons = 12, season.duration = 1))Warning in FormatInputPrePostPeriod(pre.period, post.period, data): Setting
pre.period[1] to start of data: 2012-05-01
plot(impact.n)Warning: Removed 55 rows containing missing values (geom_path).
Warning: Removed 6 rows containing missing values (geom_path).
Warning: Removed 110 rows containing missing values (geom_path).
impact.nPosterior inference {CausalImpact}
Average Cumulative
Actual 463 11114
Prediction (s.d.) 498 (14) 11964 (335)
95% CI [471, 526] [11315, 12616]
Absolute effect (s.d.) -35 (14) -850 (335)
95% CI [-63, -8.4] [-1502, -200.6]
Relative effect (s.d.) -7.1% (2.8%) -7.1% (2.8%)
95% CI [-13%, -1.7%] [-13%, -1.7%]
Posterior tail-area probability p: 0.006
Posterior prob. of a causal effect: 99.4%
For more details, type: summary(impact, "report")
Using narcotics arrests as a control, ridesharing can account for a 7% drop in DWI arrests.
Regression Discontinuity Design is another method of establishing causality without the benefit of experimental design. The key assumption is that, with regards to a time series, the data on either side of a treatment inflection are otherwise similar - nothing else special has happened. We’ll look at regressions on either side of the treatment and see how they look - first with just LM and then with a specialized library.
ggplot(data = filter(dwi,date > '2012-04-01'),mapping = aes(date, count, color = treatment) )+ geom_point() + geom_smooth(method = 'lm')Visually, you might get a sense of the effect of ridesharing over the period marked in green, but let’s keep looking.
dwi$index <- seq.int(nrow(dwi))
dwi.w <- filter(dwi, date < '2016-05-01', date > '2012-04-1')
dwi.wo <- filter(dwi, date < '2016-12-31', date > '2014-05-30')
dwi.rdd.w <- rddtools::rdd_data(y = dwi.w$count,x = dwi.w$index,cutpoint = 77)
dwi.rdd.wo <- rddtools::rdd_data(y = dwi.wo$count,x = dwi.wo$index,cutpoint = 101)
reg_para.w <- rdd_reg_lm(rdd_object = dwi.rdd.w, order = 1)
reg_para.wo <- rdd_reg_lm(rdd_object = dwi.rdd.wo, order = 1)
print(reg_para.w)### RDD regression: parametric ###
Polynomial order: 1
Slopes: separate
Number of obs: 48 (left: 24, right: 24)
Coefficient:
Estimate Std. Error t value Pr(>|t|)
D -12.439 19.264 -0.6457 0.5218
plot(reg_para.w)print(reg_para.wo)### RDD regression: parametric ###
Polynomial order: 1
Slopes: separate
Number of obs: 30 (left: 23, right: 7)
Coefficient:
Estimate Std. Error t value Pr(>|t|)
D -27.432 26.477 -1.0361 0.3097
plot(reg_para.wo)Checking the p-values for the regressions on either side of both discontinuities reveals that both aren’t significantly different from random chance. The second discontinuity - while much sharper - is based on too little data to conclude anything sharply, although it is certainly suggestive.
Our conclusions were a bit mixed - DWI per $ tax revenue ended up being not useful, the data was in some cases too noisy for a solid conclusion with Holt-Winters, CausalImpact predicted a counter-intuitive result for post-Uber DWI’s, and the data was also too noisy for a conclusive regression discontinuity analysis.
Still, our three methods did agree that ridesharing did move the needle on DWI arrests for the treatment period starting June 2014, and Holt-Winters provided a confident assessment of the period starting May 2016, which is something. I think that, anticipating further study, we can hesitantly attribute a drop in DWI arrests of approximately 5-10% to the presence of ridesharing in Austin, and an increase in DWI arrests of approximately 0-5% to the Uber/Lyft exit, after controlling for external effects.
Deterring Drunk Driving Fatalities: An Economics of Crime Perspective Bruce L. Benson and David W. Rasmussen
Inferring Causal Impact Using Bayesian Structural Time-series Models Kay H. Brodersen, Fabian Gallusser, Jim Koehler, Nicolas Remy, Steven L. Scott
Regression Discontinuity Designs in Economics David S. Lee and Thomas Lemieux
The MIT License (MIT)
Copyright (c) 2017 Ian Wells
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
DRAFT DRAFT DRAFT